In this notebook, we're going to look at how to implement linear regression. We're going to look at the 1-dimensional case, where we use a single feature to estimate the value of a related variable. Once you're comfortable with the single-variable case, try our notebook on linear regression with multiple features!
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
n = 100 #Number of observations in the training set
#Assign True parameters to be estimated
trueIntercept = 10
trueGradient = -4
x = np.random.uniform(0, 100, n)
y = trueIntercept + trueGradient*x + np.random.normal(loc=0, scale=40, size = n) #This is the true relationship that we're trying to model
data = pd.DataFrame({'X':x, 'Y':y}) #Put the two arrays into a dataframe to make it easy to work with
data.head(10) #Inspect the first ten rows of our dataset
plt.scatter(data['X'], data['Y'])
plt.show()
There appears to be a fairly linear relationship between height and weight in this dataset. Linear regression therefore appropriate without doing feature engineering
To fit a linear regression model to data $(x_1, y_1), (x_2, y_2),..., (x_n, y_n)$ such that $y_i = \alpha + \beta x_i + \epsilon_i$, where $\epsilon_i$ is the ith error term, the Least-Squares estimates for the parameters $\alpha, \beta$ are:
$$\hat \beta = \frac{\sum^{n}_{i=1}(x_i - \bar x)(y_i - \bar y)}{\sum^{n}_{i=1}(x_i - \bar x)^2}$$ $$\hat \alpha = \bar y - \hat \beta \bar x $$Once we've calculated $\hat \alpha \& \hat \beta$ we can estimate the label corresponding to a newly observed feature, $x^*$ as:
$$ \hat y = \hat \alpha + \hat \beta \times x^* $$The short answer is: Multivariate Calculus! There are lots of existing resources online which do a good job of explaining how we derive the equations for linear regression - check out this link if you want to read more.
class LinearRegression1D:
def __init__(self, data, target, feature, trainTestRatio = 0.9):
#data - a pandas dataset
#target - the name of the pandas column which contains the true labels
#feature - the name of the pandas column which we will use to do the regression
#trainTestRatio - the proportion of the entire dataset which we'll use for training
# - the rest will be used for testing
self.target = target
self.feature = feature
#Split up data into a training and testing set
self.train, self.test = train_test_split(data, test_size=1-trainTestRatio)
def fitLR(self):
#Fit a linear regression to the training data
#The goal is to estimate the gradient of the line of best fit and the intercept of that line
#Calculate the estimate for the gradient
#if possible try to do it in one line by doing pointwise operations on the arrays
self.betaHat = np.sum((self.train[self.feature] - np.mean(self.train[self.feature]))*(self.train[self.target] - np.mean(self.train[self.target])))/ np.sum((self.train[self.feature] - np.mean(self.train[self.feature]))**2)
#Use the value of self.betaHat you've just calculated to calculate self.alphaHat, the estimated intercept
self.alphaHat = np.mean(self.train[self.target]) - self.betaHat*np.mean(self.train[self.feature])
return 0 #We've saved the parameter values as part of the class now
def predict(self,x):
#Given a vector of new observations x, predict the corresponding target values
prediction = self.alphaHat + self.betaHat*x
return prediction
myModel = LinearRegression1D(data, 'Y', 'X')
myModel.fitLR() #If this returns a zero, then it should have finished. If not we've got problems!
We can print the parameters we estimated for the gradient and intercept. Recall that the true values are saved as trueIntercept and trueGradient
print(f'The true intercept value was {trueIntercept}, we estimated the value to be {myModel.alphaHat}')
print(f'The true gradient value was {trueGradient}, we estimated the value to be {myModel.betaHat}')
Not too bad! We didn't have a particularly large dataset to train on so to be in the right ballpark with the parameters is perfectly acceptable, given the variability within the data. A larger dataset would most likely give a more accurate estimation of the two values.
x = np.arange(np.floor(data['X'].min()), np.ceil(data['X'].max()))
plt.scatter(data['X'], data['Y'], label = 'Data')
plt.plot(x, myModel.alphaHat + x*myModel.betaHat, label = 'Regression line')
plt.legend()
plt.show()
We can see that our regression line runs through the centre of the point cloud, indicating that the model fits nicely.
testPred = myModel.predict(myModel.test[myModel.feature]) #What our models thinks the true values of y are, given x
residuals = testPred - myModel.test[myModel.target]
plt.scatter(list(range((residuals.shape[0]))), residuals)
plt.ylabel('Residuals')
plt.xlabel('index')
plt.show()